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Background. The assessment of cancer treatment in oncological clinical trials is usually based on serial measure- 
ments of tumours' size according to the Response Evaluation Criteria in Solid Tumours (RECIST) guidelines. The aim 
of our study was to evaluate the variability of measurements of target lesions by readers as well as the impact on 
response evaluation, workflow and reporting. 

Patients and methods. Twenty oncologic patients were included to the study with CT examinations from thorax to 
pelvis performed at a 64 slices CT scanner. Four readers defined and measured the size of target lesions independently 
at baseline and follow-up with PACS (Picture Archiving and Communication System) and LMS (Lesion Management 
Solutions, Median technologies, Valbonne Sophia Antipolis, France), according to the RECIST 1.1 criteria. Variability 
in measurements using PACS or LMS software was established with the Bland and Altman approach. The inter- and 
infra-observer variabilities were calculated for identical lesions and the overall response per case was determined. In 
addition, time required for evaluation and reporting in each case was recorded. 

Results. For single lesions, the median infra-observer variability ranged from 4.9-9.6% (mean 5.9%) and the median 
inter-observer variability from 4.3-1 1.4% (mean 7.1%), respecting different evaluation time points, image systems and 
observers. Nevertheless, the variability in change of A sum longest diameter (LD), mandatory for classification of the 
overall response, was 24%. The overall response evaluation assessed by a single respectively different observer was 
discrepant in 6.3% respectively 1 2% of the cases compared with the mean results of multiple observers. The mean case 
evaluation time was 286s vs. 228s at baseline and 267s vs. 196s at follow-up for PACS and LMS, respectively. 
Conclusions. Uni-dimensional measurements of target lesions show low intra- and inter-observer variabilities, but the 
high variability in change of A sum LD shows the potential for misclassification of the overall response according to the 
RECIST 1.1 guidelines. Nevertheless, the reproducibility of RECIST reporting can be improved for the case assessment 
by a single observer and by mean results of multiple observers. Case-based evaluation time was shortened up to 27% 
using custom software. 
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Introduction 

The accurate assessment of tumour size is essen- 
tial for clinical oncological trials. 1 Decision on the 
subsequent cancer treatment often depends on ra- 



diological reports about current status and changes 
in tumour burden. 2 3 For comparison and interpre- 
tation of oncological trial results it is important to 
classify measurements of tumour burden consist- 
ently and reproducible, independent of different 
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clinical institutions and observers. Definite guide- 
lines for standardization of tumour measurements 
and response evaluation were published in 2000 as 
a Response Evaluation Criteria in Solid Tumours 
(RECIST) criteria. 4 These guidelines define the 
selection of target lesions in terms of number, lo- 
calization, minimal tumour size and measurability. 
Parameters for the overall response evaluation are 
progressive disease (PD), stable disease (SD), par- 
tial response (PR) and complete remission (CR). 
Beside a high accuracy for the quantification of tu- 
mour progress or shrinkage it is desirable to sim- 
plify and shorten international guidelines as far as 
possible. In this context, the revised RECIST guide- 
lines 1.1 were published in 2008 with, amongst oth- 
ers, changes in the total number of target lesions 
(5, formerly 10) and in standards for measurement 
of e.g. lymph nodes (1.5 cm short axis minimum 
for target lymph node). 5 However, quantitative 
reporting in clinical routine with measurements of 
multiple lesions is costly and time-consuming, but 
would be desirable for each oncologic patient. 

The aim of our study was to evaluate the vari- 
ability of target lesion measurements by readers as 
well as the impact on overall response evaluation, 
workflow and reporting. 

Patients and methods 

Study population 

Twenty oncologic patients (11 male, 9 female, mean 
age 60±14 years) were included, selected randomly 
from our clinical study archive. Primary tumour 
histology was lung cancer (NSCLC n=6, SCLC 
n=l), colon cancer (n=3) and urothelium cancer 
(n=3) as well as n=l each for cancer of pancreas, 
breast cancer, endometrial cancer, teratoma, germ 
cell tumour, and lymphoma. All patients had two 
CT examinations from thorax to pelvis (at baseline 
and follow-up), performed at a 64 slices CT scan- 
ner (Siemens, Forchheim, Germany) with the ap- 
plication of intravenous contrast agent in all cases. 

Image analysis 

Evaluation was performed on images with a recon- 
struction kernel of 30 and a slice thickness of 5 mm, 
but both, the soft tissue (window width, 500HU; 
window level, 55HU) and the lung window (win- 
dow width, 1,500HU; window level, -600HU) set- 
ting could be applied. Uni-dimensional (ID) meas- 
urements of target lesions for baseline and follow- 
up were performed according to the RECIST 1.1 



guidelines, non-target or new lesions were not 
respected. The target lesions were not preselected, 
thus each observer defined individually appropri- 
ate lesions. Note, target lesions defined at baseline 
and invisible in follow-up examinations were ex- 
cluded from statistical computations. 

Four radiologic specialists with more than 5 
years experience in oncologic radiology performed 
in our study. At the end, each observer had pre- 
pared 4 reports per case, one each for baseline and 
follow-up for both, PACS and LMS. The lag time 
between readings was at least 4 weeks and case 
evaluation was prepared in a random order. 

PACS (Picture Archiving and 
Communication System) 

Previous tumour measurements were not shown 
and actual measurements not stored within the 
images. Results of PACS-based assessments 
were documented using a standard, handwritten 
EORTC (European Organization for Research and 
Treatment of Cancer) formula. Patient and exami- 
nation data as well as ID-measurements for target 
lesions, slice position (z-orientation) and potential 
individual descriptive comments for clarification 
(e.g. liver metastasis, segment five) were listed. 
Anatomic subsumption was set according to the 
following categorization: 1 = primary tumour; 2 = 
lymph node; 3 = lung metastasis; 4 = liver metasta- 
sis; 7 = skin metastasis; 8 = other soft tissue metas- 
tasis; 9 = other metastasis. The sum of the longest 
diameters (LD) of the target lesions per case was 
calculated for baseline and follow-up examinations 
as well as the change in %. Time was taken after 
reading of the clinical report respectively the base- 
line report and arrangement of the images for the 
evaluation and stopped after the completion of the 
report. 

LMS software (Lesion Management 
Solutions, Median technologies, 
Valbonne Sophia Antipolis, France) 

Each observer was previously introduced to LMS 
using five teaching cases. One data base was pro- 
vided for each reader and baseline tumour meas- 
urements as well as the slice position of the target 
lesions were stored. Finally, an automatically gen- 
erated quantitative report was created showing 
the patient and examination data and summarizes 
the measured values and sum LD. In follow-up re- 
ports, the calculated alteration of sum LD in % was 
provided additionally. Furthermore, snap shots 
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FIGURES 1A, B. Graphs show the agreement of measurements of all lesions evalu- 
ated with PACS and LMS. Absolute (A) and relative (B) differences between both 
measurements are plotted against the mean diameter of the lesions. Mean differ- 
ence is shown by a continuous line. Dashed lines indicate the limits of 1 .96 standard 
deviations from the mean. A total of 93.8% (384 of 409) of the values lie within the 
1 .96 SDs of the mean (dashed lines). 



of the target lesions were shown. Time was taken, 
after reading of the clinical report respectively the 
baseline report and arrangement of the images for 
the evaluation and stopped after printing of the re- 
port. 

Statistical analysis 

The size of the target lesions (Diameter D) was re- 
corded and the sum LD was calculated for each 
observer at baseline or follow-up, for both, PACS 
or LMS. 

For the following calculations, the mean diame- 
ter (D mean ) of identical lesions was calculated as ref- 



erence, summarizing ID measurements at baseline 
or follow-up from all readers and both software 
tools. 

The accuracy of the ID-measurements of the 
target lesions was quantified for each observer at 
baseline or follow-up for both, PACS or LMS, as 
[(A D vs. D mem ) I D m J x 100 (%). The differences in 
measurements of the same lesions using PACS and 
LMS software were plotted against the mean value 
by using the Bland and Altman approach. 

Intra-observer variability was assessed by com- 
paring measurements of identical target lesions at 
baseline or follow-up, identified with both soft- 
ware tools for each observer as [(A D PACS vs. D LMS )I 
D m JxlOO(%) 

Accordingly, inter-observer variability was de- 
termined as the difference between measurements 
of identical target lesion for each pair of observers 
(O) at baseline or follow-up comparing same imag- 
ing systems (PACS vs. PACS resp. LMS vs. LMS) 
or different imaging systems (PACS vs. LMS resp. 
vice versa LMS vs. PACS) as [(A D ox vs. D OY ) I D me J 
x 100 (%) 

To assess the overall response, the change of 
sum LD was calculated as A sum LD = (sum LD baseline 
- sum LD follow _ uv I sum LD basel J x 100 (%) 

Additionally, the summarized A sum LD was 
calculated per case, thus summarizing all evalu- 
ated target data (D mean ) from both imaging systems 
and all observers per case. 

The case evaluation time was calculated as mean 
for each and all observers at baseline or follow-up, 
for both, PACS and LMS. 

Data are presented as mean, median, 10%, 
and 90% percentile. Measurements were com- 
pared with a paired two-tailed student's t-test. 
Crosstabulation statistics were performed using 
the McNemar-Bowker Test. A p-value <0.05 was 
considered to indicate a statistical significance. 

The study was carried out according to the 
Declaration of Helsinki. 



Results 

A total of 320 RECIST reports were performed (4 
observers x 20 cases x 2 evaluation time points x 2 
software tools = 320). 

As target lesions were not preselected, each ob- 
server identified independently up to five lesions 
per case. Five target lesions were selected in 44 
cases, 4 target lesions in 22 cases, 3 target lesions 
in 39 cases, and 2 target lesions in 55 cases. No re- 
ports were completed with a single target lesion. 



Radiol Oncol 2012; 46(1): 8-18. 



Muenzel D et al. / Variability in RECIST-based response evaluation 



The mean number of target lesion was 3.3 using 
PACS and 3.4 using LMS. 

Altogether 120 different target lesions were de- 
fined. Twenty-one % of these target lesions have 
been selected consistently by all four readers and 
both software modalities. Twenty-nine % of the 
target lesions were selected only by one reader. A 
maximum of 10 different target lesions were ob- 
served in two patients with NSCLC and a carci- 
noma of the urothelium with multiple metastases 
to the liver, the lung and lymph nodes. 

Measurements of all lesions evaluated by PACS 
and LMS for baseline and follow-up assessment 
were compared. Figure 1 shows Bland- Altman 
analysis of the differences of percent diameter 
shrinkage measured by PACS and LMS compared 
to the average percent diameter stenosis by the two 
methods. The reproducibility of ID measurements 
for all lesions was excellent with a mean difference 
in volume measurements amounted to -0.9 mm, 
with the 95% confidence interval ranging from -10 
to 8.3 (Figure 1A). The mean relative difference 
amounted to -2.9 %, with a 95% confidence interval 
of -22.9 to 17.1 (Figure IB). 

Table 1 summarizes mean target size (mm) and 
variance. The smallest diameter of a target lesion 
was consistent to the RECIST guidelines 10 mm in 
baseline reports. The largest mean target diameter 
at baseline was 132 mm for a cohesive group of liver 
metastases. In follow-up examinations, the variance 
of target lesions ranged between 5 mm and 152 mm. 

The mean sum LD (mm) and variance are pre- 
sented in Table 2 showing comparable ranges. 

The accuracy (%) of single ID target measure- 
ments relatively to D mean as well as the 10%- and 
90%-percentile are documented in Table 3. A high 
mean accuracy of approximately 95% can be found. 

The intra- and inter-observer variabilities for 
target measurements are displayed in Table 4, 5, 
and 6. The mean intra-ob server variability was 
5.0% at baseline and 6.8% at follow-up. The inter- 
observer variability was higher with values be- 
tween 6.0-7.2% at baseline and 6.7-9.1% follow-up. 
Overall inter-observer variability was significantly 
higher than intra-observer variability for baseline 
and follow-up examinations (p<0.01 and p<0.05, 
respectively). There were no statistical significant 
differences comparing the both imaging systems, 
PACS and LMS. 

Figures 2 and 3 illustrate variability of measure- 
ments in lesions with well-defined edges (Figure 2) 
and metastasis with irregular contours (Figure 3). 

Table 7 lists the maximum and minimum A sum 
LD (%) and the overall response in all 20 cases. 




FIGURE 2. Tumor measurements of a well-marginated lymph 
node metastasis in a patient with renal cell carcinoma showed 
low mean inter-observer variabilities with 5.4 % for baseline (A) 
and 5.1% for follow-up (B), respectively. Mean intra-observer 
variability was low with 1 .2% for (A) and (B). 




FIGURE 3. Poorly marginated, confluent liver lesions in a pa- 
tient with NSCLC. Mean inter-observer variability was 14.9% for 
baseline (A) and 10.3% for follow-up (B), respectively. Mean 
intra-observer variability was 16.8% (A) and 7.7% (B). 



Despite a difference between maximum and mini- 
mum sum LD of 24%, misclassifications occurred 
in only 10 cases. There were no significant differ- 
ences in response categorization for both imag- 
ing systems (p = 0.513). A high concordance could 
also be demonstrated to the summarized overall 
response, based on all assessed target lesions per 
case. 

Table 8a-c shows the number of misclassifica- 
tions for the overall response evaluation based on 
identical target lesions. Results for the assessment 
of the overall tumour response were compared for 
a single observer with all combinations of different 
observers (n=480) (a), a single observer vs. mean re- 
sults of all observers (n=160) (b), and for different 
observers vs. mean results of all observers (n=480) 
(c). The number of misclassified cases can be re- 
duced for the case assessment by a single observer 
and by mean results of all observers. Obviously, 
mean results of all observers equalize the outliers. 
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TABLE 1. 1 D measurement of target lesions (mm) for each observer using PACS or LMS at baseline or follow-up 



Evaluation 



System 



Observer 



Mean target size Minimum target 
(mm) size (mm) 



Maximum target 
size (mm) 



Baseline 



PACS 



LMS 



35.7 
40.9 
38.1 
38.8 

36.7 
41.7 
37.9 
41.1 



125 
132 
117 
121 

120 
126 
125 
126 



Follow-up 



PACS 



34.9 
39.6 
37.3 
38.5 



129 
136 
152 
131 



LMS 



35.0 
40.9 
35.5 
40.8 



129 
126 
133 
132 



TABLE 2. Sum of the longest diameters of target lesions (mm) per case for each observer using PACS or LMS at baseline or follow-up 



Evaluation 



System 



Observer 



Mean 
sum LD (mm) 



Minimum sum LD Maximum sum LD 
(mm) (mm) 



Baseline 



PACS 



118.3 
143.1 
120.1 
130.1 



35 
38 
34 
41 



261 
330 
312 
310 



LMS 



127.4 
139.6 
130.6 
129.9 



29 
40 
39 
37 



296 
336 
305 
326 



Follow-up 



PACS 



115.1 
136.5 
117.4 
129.0 



31 

34 
25 
33 



310 
359 
326 
342 



LMS 



121.7 
136.9 
122.6 
128.4 



29 
36 
34 
28 



315 
356 
327 
359 
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TABLE 3. Accuracy of 1 D measurements of target lesions in comparison to D mean (%) for each observer using PACS or LMS at base- 
line or follow-up 



Evaluation 



System 



Observer 



Median 
(%) 



10% Percentile 



90% Percentile 



Baseline 



PACS 



94.3 
94.8 
97.1 
95.8 



85.1 
88.5 
85.3 
84.4 



98.5 
99.7 
100 
99.6 



Mean 



95.5 



LMS 



95.7 
95.3 
96.3 
95.7 



85.9 
84.8 
83.7 
84.6 



100 
99.9 
99.2 
98.9 



Mean 



95.7 



Follow-up 



PACS 



93.9 
96.8 
96.6 
96.0 



79.2 
84.4 
84.1 
78.3 



99.4 
99.3 
100 
100 



Mean 



95.8 



LMS 



96.0 
94.1 
94.5 
93.8 



79.7 
72.0 
78.5 
85.6 



99.6 
99.5 
99.3 
99.2 



Mean 



94.6 



The mean time needed to prepare a baseline re- 
port was 286 s for PACS and 228 s for LMS soft- 
ware. At follow-up, mean time for PACS reporting 
was 267 s versus 196 s using LMS (Table 9). Thus, 
LMS induces a gain of time of 20.8% at baseline 
and 26.6% at follow-up (p<0.01). 

Discussion 

In the study we assigned low intra- and inter-ob- 
server variability for target lesion measurements 
according to the RECIST 1.1 guidelines. However, 



the high variability in change of A sum LD shows 
the potential for misclassification of the overall 
response evaluation, but the reproducibility of 
RECIST reporting can be improved for the case as- 
sessment by a single observer and by mean results 
of multiple observers. Time required for the assess- 
ment and creation of a study report was decreased 
using custom software. 

The assessment of tumour response in oncologi- 
cal clinical trials is usually based on serial measure- 
ments of primary tumour and metastases using CT 
examinations before and in the course of tumour 
therapy regimens. For consistent evaluation of 
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tumour response concrete criteria for a standard- 
ized categorization of changes in tumour burden 
are necessary. ID measurements for the calculation 
of tumour burden were introduced by Therasse et 
al. A and the revised RECIST guidelines (version 
1.1) were published in 2009 with the intention of 
further simplifying and standardizing tumour re- 
sponse criteria. 5 Among others, the number of tar- 
get lesions was restricted to a maximum of 5 with 
maximum of two lesions per organ. For target le- 
sions, the longest diameter has to be assessed for 
tumour measurements except for lymph nodes, 
which are assessable as target lesion with a short 
axis > 15 mm. For quantifying tumour burden, the 
sum of longest diameter of all target lesions is cal- 
culated. Similarly, for some rare tumours, i.e. ma- 
lignant mesothelioma, where the modified RECIST 
criteria were proposed, the tumour thicknesses are 
measured perpendicular to the chest wall in two 
sites at 3 levels and the sum of lesions' diameters 
is calculated. 6 

In our study only target lesions were evaluated 
for reports of the tumour assessment in order to 
facilitate the comparison of the results of all four 
observers. Each observer individually defined tar- 
get lesions out of the complete CT examination 
without any study-dependent pre-selection, so the 
setting of our study was closely adapted to clinical 
study reports. 

A high intra- and inter-observer concordance 
of RECIST based quantifications of tumour bur- 
den is essential for a valid assessment of response 
to anticancer therapy regimens. Considering the 
agreement of measurements of identical lesions for 
each observer using PACS and LMS, intra-observer 
variability was low for all four observers with a 
mean difference of 5.9%. The inter-observer vari- 
ability was slightly higher than the intra-observer 
variability with a mean variability of 7.1%. This is 
of special importance in case that different radi- 
ologists assess baseline and follow-up reports, as 
the RECIST guidelines do not advise for the same 
reader of tumour evaluation during an oncological 
trial. 5 In contrary to our study, other studies evalu- 
ated the variability of tumour measurements using 
predefined single lesions. Erasmus et al. estimated 
mean intra- and inter-observer variability's of 5.5% 
respectively 12.3% for ID measurements, including 
irregular defined lesions. 7 The lower discrepancies 
in our study might be due to a preferred selection 
of lesions with well-defined edges and avoiding 
of measurements of irregular shaped tumours' le- 
sions, as it is suggested for targets by the RECIST 
guidelines. 



TABLE 4. Intra-observer variability for PACS vs. LMS at baseline 
or follow-up 



Evaluation Observer 



Median 
(%) 



10% 



90% 



Percentile Percentile 



Baseline 



1 

2 
3 
4 

Mean 



4.9 


0.0 


15.0 


4.9 


0.0 


14.1 


5.0 


0.0 


17.2 


5.4 


0.0 


17.8 



5.0 



Follow-up 


1 


5.2 


0.0 


18.9 




2 


5.4 


0.0 


21.9 




3 


6.9 


0.9 


29.2 




4 


9.6 


0.7 


22.5 



Mean 



6.8 



Despite the variability of single measurements 
the conclusive evaluation of the treatment re- 
sponse is of special interest for therapeutic de- 
cisions in clinical trials. 36 According to RECIST 
guidelines, an increase of 20% of sum LD in fol- 
low-up examinations indicates disease progression 
(PD). A decrease of minimum 30% is considered as 
PR, whereas changes of sum LD between -30% and 
+20% is SD. 5 ' 6 In our study results of all observers 
showed excellent concordance for estimation of tu- 
mour response, but it has to be stated, that the mean 
difference of the A sum LD was 24%. Therefore, 
cases with tumour growth or tumour shrinkage 
in the region of the threshold for PD and PR will 
be problematic. In those cases standard deviation 
of single measurements may have an increased in- 
fluence on the conclusion of the tumour response 
report. Furthermore, misclassification of overall re- 
sponse evaluation was higher if different observers 
assessed baseline and follow-up examinations, but 
can be reduced for the case assessment by a single 
reader and mean assessment of multiple readers. 

A controversially discussed approach is the 
minimum number of target lesions needed for 
valid tumour evaluation. 810 We confirmed a high 
accuracy of the treatment response categorization 
with up to five target lesions according to RECIST 
1.1 compared to conclusive results summariz- 
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TABLE 5. Inter-observer variability. Difference of baseline ID 
measurements of target lesions between two observers using 
PACS and/or LMS relative to mean tumour size (%) 





Observer 


Median 


10% 


90% 


System 


pairs 


(%) 


Percentile 


Percentile 




1 12 


9.9 


o 


22.5 




1/3 


5.6 


0 


21.6 




1 /4 


6.2 


o 


26.5 


PACS 


2/3 


6.8 


o 


15.9 


vs. 










rALo 


2/4 


6.2 


1 .8 


1 8.2 




3/4 


4.7 


0 


22.3 




Mean 


6.5 








1 12 


7.6 


1 .5 


20.6 




1/3 


6.1 


0 


21.9 




1 /4 


5.6 


2.1 


23.2 


LMS 


2/3 


6.2 


o 


22.1 


vs. 










1 KA^ 
L/Vlo 


2/4 


5.8 


o 


22.6 




3/4 


4.9 


0 


24 




Mean 


6.0 








1 12 


10.3 


3 


22.3 




1/3 


8.4 


0 


21.9 




1 /4 


7.9 


1 .5 


1 6.2 


PACS 


2/3 


5.7 


o 


23.2 


vs. 










1 KA^ 
L/Vlo 


2/4 


5.0 


2.2 


25.9 




3/4 


6.2 


0 


22.3 




Mean 


7.2 








1/2 


6.6 


0.7 


12 




1/3 


4.8 


0 


24 




1/4 


5.8 


0 


25.3 


LMS 


2/3 


7.9 


0 


25.4 


vs. 










PACS 


2/4 


6.2 


0.3 


25.7 




3/4 


6.1 


0 


27.9 




Mean 


6.2 







TABLE 6. Inter-observer variability. Difference of follow-up ID 
measurement of target lesions between two observers using 
PACS and/or LMS relative to mean tumour size (%) 





Observer 


Median 


10% 


90% 


System 


pairs 


(%) 


Percentile 


Percentile 




1 12 


9.3 


o 


26.3 




1/3 


7.9 


1.3 


27.1 




1 /4 


8.0 


1 .2 


33.2 


PACS 


2/3 


5.6 


o 


26.6 


vs. 










PAPQ 
rA^o 


2/4 


4.3 


o 


18.8 




3/4 


4.9 


0 


32.2 




Mean 


6.7 








1 12 


7.6 


0.4 


42.9 




1/3 


6.9 


0 


30.7 




1 /4 


8.5 


o 


21 .0 


LMS 


2/3 


7.6 


o 


42.9 


vs. 










L/Vlo 


2/4 


9.8 


o 


45.2 




3/4 


9.1 


1 .6 


30.5 




Mean 


8.2 








1 12 


10.8 


1 .9 


41 .5 




1/3 


9.4 


0 


31.3 




1 /4 


1 1 .4 


2.4 


28.6 


PACS 


2/3 


6.0 


o 


28.4 


vs. 










1 hA<\ 
L/Vlo 


2/4 


7.8 


1.1 


1 7.1 




3/4 


9.3 


0 


28.4 




Mean 


9.1 








1/2 


8.1 


1.8 


24.1 




1/3 


6.5 


2.4 


27.0 




1/4 


8.0 


0.8 


27.0 


LMS 


2/3 


5.9 


0 


45.2 


vs. 










PACS 


2/4 


5.4 


0 


30.9 




3/4 


8.4 


0 


38.5 




Mean 


7.1 
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TABLE 7. Tumour response per case (4 observers x 2 software tools x 20 cases = 160). Maximum and minimum sum LD (%), the difference (%), overall 
response, and the number of misclassifications are shown. Summarized A sum LD (%) and overall response were calculated based on D mean of all target 
lesions per case, summarizing all observers and imaging systems. PR= Partial Response, SD= Stable Disease, PD= Progressive Disease; LD: sum of longest 
diameters 



Case 


Maximum and minimum 
A sum LD (%) 


Difference 
(%) 


PR 


Overall Response 
SD 


PD 


Misclassification 


Summarized 
A sum LD (%) 


Summarized 
overall re- 
sponse 


1 


-36 


-13 


23 


2 


6 


0 


2 




-18 


SD 


2 


-18 


14 


32 


0 


8 


0 


0 




-4 


SD 


3 


-17 


14 


31 


0 


8 


0 


0 




-7 


SD 


4 


0 


23 


23 


0 


6 


2 


2 




14 


SD 


5 


-27 


-10 


17 


0 


8 


0 


0 




-22 


SD 


6 


-10 


5 


15 


0 


8 


0 


0 




2 


SD 


7 


4 


10 


6 


0 


8 


0 


0 




7 


SD 


8 


-45 


-36 


9 


8 


0 


0 


0 




-41 


PR 


9 


-42 


18 


60 


1 


7 


0 


1 




-1 1 


SD 


10 


-5 


15 


20 


0 


8 


0 


0 




7 


SD 


11 


-14 


-2 


12 


0 


8 


0 


0 




-7 


SD 


12 


-14 


20 


34 


0 


8 


0 


0 




11 


SD 


13 


4 


20 


16 


0 


8 


0 


0 




19 


SD 


14 


-28 


-16 


12 


0 


8 


0 


0 




-22 


SD 


15 


-7 


16 


23 


0 


8 


0 


0 




3 


SD 


16 


0 


10 


10 


0 


8 


0 


0 




6 


SD 


17 


-27 


-17 


10 


0 


8 


0 


0 




-22 


SD 


18 


5 


19 


14 


0 


8 


0 


0 




1 1 


SD 


19 


-18 


35 


53 


0 


7 


1 


1 




-3 


SD 


20 


8 


42 


50 


0 


4 


4 


4 




18 


SD 


20 






24 




160 




10 









ing all lesions. This summarized sum LD evalua- 
tion of all defined targets was closely to RECIST 
1.0 criteria providing up to ten lesions for the tu- 
mour assessment. Darkeh et al. showed an increase 
of discrepancies in tumour response evaluation if 
less than four target lesions were defined for tu- 
mour measurements. 8 In contrast, the evaluation 
of North Central Cancer Treatment Group trials 
determined two target lesions to be sufficient for 
concordant results. Also Zacharia et al. presented 
that the measurement only of one target lesion at- 
tained same classifications for tumour response in 
patients with colon cancer metastases to the liver. 10 
Simple ID measurements of target lesions were 
equivalent using PACS or LMS. Thus, our study 
provides among others "repetitive" quantitative 
data. Nevertheless, the LMS software tool pro- 
vides for the follow-up examinations the previous 



ID target measurements, marked by a line and 
stored in the images. This is advantageous for se- 
rial measurements at follow-up reports, especially 
if different observers assess tumour burden during 
anticancer treatment. It would be interesting for 
further investigations, if inter-observer variability 
could be decreased by such a software tool in case 
that baseline and follow-up reports are performed 
by different readers. Considering the temporal ef- 
fort required for the complete target evaluation 
and creation of a RECIST based report of tumour 
burden, there is a gain of time using LMS software, 
which might help to persuade radiologists to per- 
form RECIST reports for each oncological patient. 

A limitation of our study was a disproportionate 
incidence of the overall tumour response of "stable 
disease". This is partly caused by the predetermi- 
nation to assess only the development of target le- 
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TABLE 8. Mean case evaluation time including reporting using 
PACS or LMS at baseline or follow-up 



Evaluation 


Observer 


Mean time (s) 
PACS LMS 


p-value 


Baseline 


1 


395 


310 


<0.01 




2 


232 


212 


<0.01 




3 


00Q 
Z70 


\ / 0 


<0.05 




4 


219 


216 


<0.01 




Mean 


286 


228 


<0.01 


Follow-up 


1 


377 


254 


<0.01 




2 


252 


201 


<0.01 




3 


219 


154 


<0.01 




4 


222 


173 


<0.01 




Mean 


267 


196 


<0.01 



sions, whereas non-target lesions and new lesions 
were not evaluated. It has been shown e.g. that in 
60% of the cases PD is based on the occurrence 
of new tumourous lesions. 11 Another explanation 
concerning PR was the fact, that baseline and first 
follow-up examinations of metastasized cancer 
patients were included to our study and PR may 
occur in the time course of the treatment. The po- 
tential saving of time using LMS could have been 
higher, as the readers are familiar with PACS for 
years, whereas the introduction of LMS based only 
on five teaching cases. 

Perspectively, it will be of special interest to op- 
timize the radiological evaluation of tumour bur- 
den and treatment response, with a special interest 
on new imaging techniques and further improve- 
ment of guidelines for tumour measurements. 1213 
Future tumour response reports may provide volu- 
metric tumour assessment and changes of tissue at- 
tenuation, leading to a more accurate and extended 
response evaluation. The volumetric measurement 
of pulmonary nodules is already feasible with nu- 
merous quantitative software tools and could be 
integrated into clinical routine. 1415 However, fur- 
ther increase of consistency of volumetric assess- 
ment of pulmonary nodules and low variability of 
semi-automated volume measurements will be re- 
quired. 141617 For the complete tumour assessment 



semi-automated measurements of e.g. liver lesions 
and lymph nodes is necessitated and currently 
work in progress. Thus, up to now there are only a 
few results testing reproducibility and validity. 18 21 
Despite tumour shrinkage, a decrease of attenua- 
tion in contrast enhanced CT indicates tumour re- 
sponse, especially in the treatment with targeted 
therapies. Several studies declined an improve- 
ment of response evaluation after targeted therapy 
in e.g. metastatic renal cell carcinoma and squa- 
mous cell carcinoma of the upper aerodigestive 
tract when both, changes in tumour size and atten- 
uation was assessed. 22 25 Furthermore, Stacchiotti et 
al. demonstrated that additional evaluation of tu- 
mour attenuation increased predictive estimation 
of tumour response in patients with high-grade 
soft-tissue sarcomas. 26 

Conclusions 

We demonstrated in our clinical study low intra- 
and inter-observer variabilities for measurements 
of single target lesions, but the high variability in 
change of A sum LD reveals the potential for mis- 
classification of the overall response according to 
the RECIST guidelines. Nevertheless, reproducibili- 
ty of RECIST reporting can be improved for the case 
assessment by a single reader and mean results of 
multiple readers. Custom software shortened case- 
based evaluation time and further improvements 
might be challenging for therapy monitoring. 
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